Peggy Newman, Martin Westgate, Amanda Buyan, Dax Kellie & Shandiya Balasubramaniam

The problem


For researchers, getting data out of GBIF nodes is easy…

…but sharing your own data is hard.

Hurdles


  • Darwin Core Standard formatting isn’t easy (e.g., .xml)
  • Existing documentation isn’t well-suited to newbies
  • Poor integration with existing workflows (i.e. in R or Python)
  • Sharing data is low on priority list

Q: How can we help researchers share biodiversity data?

galaxias (and friends)


galaxias: Build, check & publish DWCAs
corella: Convert a tibble to Darwin Core
delma: Convert markdown to EML or xml

Darwin Core

An archive is a .zip file containing three things:

data
csv format
metadata
eml format
schema
xml format

Process



data metadata schema archive validate submit

Data

Create an example dataset

# A tibble: 2 × 5
  latitude longitude date       time  species                 
     <dbl>     <dbl> <chr>      <chr> <chr>                   
1    -35.3      149. 14-01-2023 10:23 Callocephalon fimbriatum
2    -35.3      149. 15-01-2023 11:25 Eolophus roseicapilla   

Data

How should we convert this dataset to Darwin Core?

suggest_workflow(df)

Data

If we follow that advice:

df_dwc <- df |>
  set_occurrences(occurrenceID = sequential_id(),
                  basisOfRecord = "humanObservation") |> 
  set_coordinates(decimalLatitude = latitude, 
                  decimalLongitude = longitude) |>
  set_datetime(eventDate = lubridate::dmy(date),
               eventTime = lubridate::hm(time)) |>
  set_scientific_name(scientificName = species, 
                      taxonRank = "species")

df_dwc
# A tibble: 2 × 8
  basisOfRecord    occurrenceID decimalLatitude decimalLongitude eventDate 
  <chr>            <chr>                  <dbl>            <dbl> <date>    
1 humanObservation 01                     -35.3             149. 2023-01-14
2 humanObservation 02                     -35.3             149. 2023-01-15
# ℹ 3 more variables: eventTime <Period>, scientificName <chr>, taxonRank <chr>

Data

Save as occurrences.csv:

use_data(df_dwc)

Process



data metadata schema archive validate submit

Metadata

Generate a metadata file

use_metadata_template() # creates the following file:
# Dataset
 
 ## Title
 
 A Sentence Giving Your Dataset Title In Title Case
 
 ## Abstract
 
 A paragraph outlining the content of the dataset
 
 ## Creator
 
 ### Individual name
 
 #### Surname

Metadata

Convert to EML

use_metadata("metadata.Rmd") # creates the following file:
<?xml version="1.0" encoding="UTF-8"?>
 <emlEml xmlns:d="eml://ecoinformatics.org/dataset-2.1.0" xmlns:eml="eml://ecoinformatics.org/eml-2.1.1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:dc="http://purl.org/dc/terms/" xsi:schemaLocation="eml://ecoinformatics.org/eml-2.1.1 http://rs.gbif.org/schema/eml-gbif-profile/1.3/eml-gbif-profile.xsd" system="R-paperbark-package" scope="system" xml:lang="en">
   <dataset>
     <title>A Sentence Giving Your Dataset Title In Title Case</title>
     <abstract>A paragraph outlining the content of the dataset</abstract>
     <creator>
       <individualName>
         <surname>Person</surname>
         <givenName>Steve</givenName>
         <electronicMailAddress>example@email.com</electronicMailAddress>
       </individualName>
       <organisationName>Put your organisation name here</organisationName>
       <address>
         <deliveryPoint>215 Road Street</deliveryPoint>
         <city>Canberra</city>

Process



data metadata schema archive validate submit

Archive

Automated process for building schema file (eml.xml)…

build_archive()
<?xml version="1.0" encoding="UTF-8"?>
 <archive xmlns="http://rs.tdwg.org/dwc/text/" metadata="eml.xml">
   <core encoding="UTF-8" rowType="http://rs.gbif.org/terms/Event" fieldsTerminatedBy="," linesTerminatedBy="&#13;&#10;" fieldsEnclosedBy="&amp;quot;" ignoreHeaderLines="1">
     <files>events.csv</files>
     <id index="0"/>
     <field index="1" term="http://rs.tdwg.org/dwc/terms/eventID"/>
     <field index="2" term="http://rs.tdwg.org/dwc/terms/year"/>
     <field index="3" term="http://rs.tdwg.org/dwc/terms/decimalLatitude"/>
     <field index="4" term="http://rs.tdwg.org/dwc/terms/decimalLongitude"/>
     <field index="5" term="http://rs.tdwg.org/dwc/terms/geodeticDatum"/>
     <field index="6" term="http://rs.tdwg.org/dwc/terms/coordinateUncertaintyInMeters"/>
   </core>
   <extension encoding="UTF-8" rowType="http://rs.tdwg.org/dwc/terms/Occurrence" fieldsTerminatedBy="," linesTerminatedBy="&#13;&#10;" fieldsEnclosedBy="&amp;quot;" ignoreHeaderLines="1">
     <files>occurrences.csv</files>
     <id index="0"/>

Archive

…and zipping the /data-publish folder.

build_archive()
Data (minimum of one)
  • occurrences.csv ✔
  • events.csv      ✖
  • multimedia.csv  ✖
Metadata
  • eml.xml         ✔
Schema
  • meta.xml        ✔
# A tibble: 3 × 4
  filename        compressed_size uncompressed_size timestamp          
  <chr>                     <dbl>             <dbl> <dttm>             
1 occurrences.csv             194               283 2025-07-01 02:34:30
2 eml.xml                     684              1452 2024-12-12 04:21:22
3 meta.xml                    509              2145 2024-12-12 04:21:22

Process



data metadata schema archive validate submit

Validate

# validate locally
check_directory() 

# validate via GBIF API
check_archive(username = "a_gbif_user",
              email = "my@email.com",
              password = "a_secure_password"))

Process



data metadata schema archive validate submit

Submitting

Run submit_archive() to create an issue on data-publication repository

Process



data metadata schema archive validate submit

Benefits of galaxias


  • Darwin Core Standard formatting is easy (e.g., .xml)
  • Documentation well-suited to newbies
  • Good integration with existing workflows (i.e. in R or Python)
  • Sharing data is on the priority list (?)

Thank you


Peggy Newman
Martin Westgate
Amanda Buyan
Dax Kellie
Shandiya Balasubramaniam

galaxias
corella
delma
galah